Bike Station GNN

Questions to answer:

  1. 25 most important stations overall - because bikes need to move through the graph and be available at any station for a ride, the most important stations are important because CitiBike needs to ensure bikes are stocked at those stations
  2. 25 most important stations on bad weather days - especially important on bad weather days when people need access to bikes. If these are different it means that behavior changes on bad weather days
  3. 25 least important station on bad weather days - these stations could

Vertices and Edges

Steps

  1. load the master data - startStationName, endStationName, startStationLatitude, endStationLatitude, startStationLongitude, endStationLongitude, tripduration, time_bin, peak_commute, dow, precip, feels_like - DONE
  2. add the good/bad weather days - DONE
  3. add the locationGroup ??
  4. add the bikeBehaviorGroup - DONE
  5. add crow flies distance
  6. Save the dataframe so we don't have to build it again
  7. Build the graph
    1. vertices
    2. edges
  8. save the variables in a parquet that I can load into databricks for a map visualization https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html
  9. do a page rank to determine the 25 most important and 25 least important stations
  10. filters on bad weather days
    1. save the bad weather parquest for databricks visualization
    2. do page rank on most and least important stations on bad weather days
    3. do these match to the badWeatherDurationIndex at all? == comparison of another type of grouping model

STEP 1: Load Master Data

DON'T RUN: are there any stations in startStation that aren't in endStation?

Answer: no

STEP 2: Load Good/Bad weather label for each trip

STEP 3: ADD Location Grouping

STEP 4: ADD BikeBehavior Group

STEP 5: Add Crow Flies Distant# Necessary for distance calculations

Are there any nulls, NaN's, or blanks we need to deal with?

STEP 6: Save the dataframe so we don't need to build again

START HERE IF YOU WANT TO USE THE ALREADY CREATED gnnData csv

TO TRY: Neighborhood to Neighborhood (instead of station to station)

Create a sample of the dnn dataframe for testing the GNN code

START HERE IF YOU WANT TO USE ALREADY SAVED 50% sample VERTS AND EDGES

Even with 12 cores and 120GB RAM I am unable to run page rank on the full verts and edges.

So, I am using a 50% sample from the total dataset. This sample contains 24,784,908 trips from 1621 stations.

Most important end stations

The Page Rank algorithm weighs the incoming edges to a vertex and transforms it into a score.

The idea is that each incoming edge represents an endorsement and makes the vertex more relevant in the given graph.

For example, in a social network, if a person is followed by various people, he or she will be ranked highly.

plotting with size by importance

plot the most important

If the page rank has already been created you can load the file in the next cell:

Create a pandas dataframe results of the 50% sample page rank

Show Box plot of station importance

Create a pandas dataframe with subway stations

Observations: I was expecting that the highest ranked stations outside of Lower Manhattan would all be associated with a subway station, but that is not the case. You can investigate this by zooming in on the plot.

plot only with pagerank > 3 sized

plot only with pageRank < 0.5 sized

Get the rank for good weather trips

Get rank for bad weather trips

Findings on Good Weather vs. Bad Weather Station Importance

Good Weather:

  1. 45 stations had a pagerank greater than 3
    1. 11 of the most important (MI) Good Weather stations were not MI during Bad Weather
    2. Good Weather max page rank was 5.73
    3. Top 5 stations were:
      1. Front St & Washington St
      2. 1 Ave & E 68 St
      3. E 17 St & Broadway
      4. West St & Chambers St
      5. 134 Kent Ave & N 7 St

Bad Weather:

  1. 41 stations had a pagerank greater than 3.
    1. 7 of the MI Bad Weather stations were not MI in the Good Weather data set.
    2. Bad Weather max page rank was 5.498
    3. Top 5 stations were:
      1. 1 Ave & E 68 St
      2. Front St & Washington St
      3. Pershing Square North
      4. E 17 St & Broadway
      5. W 21 St & 6 Ave

Determine unique to Good Weather and Bad Weather

NEED TO DO: Observations of good weather and bad weather important stations:

Compare most important stations March 2019 - Feb 2020 vs. March 2020 - Feb 2021

Which stations are different from pre-covid, vs. covid?

Save the dataframes with most important pre-covid and most important covid:

Findings on Pre-COVID vs. COVID

Pre-COVID:

  1. 14 stations had a pagerank greater than 3
    1. 6 of the most important (MI) pre-COVID stations were not MI during COVID
    2. Pre-COVID max page rank was 6.97
    3. Top 5 stations were:
      1. Front St & Washington St
      2. N 6 St & Bedford Ave
      3. S 3 St & Bedford Ave
      4. N 12 St & Bedford Ave
      5. E 17 St & Broadway

During COVID:

  1. 63 stations had a pagerank greater than 3.
    1. 55 of the MI COVID stations were not MI in the pre-COVID data set.
    2. COVID max page rank was 10.71
    3. Top 5 stations were:
      1. 1 Ave & E 68 St
      2. E 13 St & Avenue A
      3. Front St & Washington St
      4. Broadway & W 60 St
      5. W 21 St & 6 Ave

Map the Unique to Pre-COVID

Observation: 33%, or two of the six unique-to-pre-COVID high-rank bike stations are directly at a subway station.

Map the Unique to COVID

Observation: 25%, or 14 of the 55 unique-to-COVID high-rank bike stations are directly at a subway station.

For April 2018 - Oct, 15 2018, in the Evening and Night time bins, what is the rank of the Citi Bike stations nearest Nacho Mama's restaruant?

In 2018 The closest Citi Bike station to Nacho Mama's was Amsterdam Ave & Morningside Dr. which belongs to Bike Beahvior Group 0. Group 0 appears to have these characteristics in comparison to other groups:

Bike Behavior Group 0 has a median rank of 0.4389, the Amsterdam Ave & Morningside Dr station has a page rank of 0.426.

Observations:

While it has been reported by reputable sources that Nacho Mama's was a great restaurant, it does not appear that it's Citi Bike station has significantly increased ranks.